Skewed finite hashing function

ABSTRACT

A portion of the global memory of a multiprocessing computer system is allocated to each node, called local memory space. Data from a remote node may be copies to local memory space of a node such that accesses to the data may be performed locally rather than globally. The copies data is referred to as a shadow page. The global address of the data is translated to a local physical address for the node to which the data is copied. To reduce the size of the translation tables for converting between global addresses and local physical addresses, the page to which shadow copies may be stored and which global addresses may be converted to local physical addresses may be restricted. Multiple page of local memory space may be allocated to one entry of a local physical address to global address (LPA2GA) table. When a page is allocated to store shadow pages, an entry in the LPA2GA table associated with that page is marked as unavailable. Accordingly, new translations may not be stored to that entry of the LPA2GA table and other pages associated with that entry may not be allocated to store shadow pages. In a similar manner, multiple pages of the global address space are mapped to an entry in a global address to local physical address (GA2LPA) translation table. When data corresponding to a page within the global address space is stored as a shadow page, the entry associated with the global address is marked as unavailable. Accordingly, other pages associated with that entry of the GA2LPA table may not be stored as shadow pages because the entry is not available. The local copy of the data is not stored and the node must access the data globally. To decrease the probability that an entry is not available for a page, the GA2LPA table may be implemented as a set associative table. To further increase the availability of entries in the GA2LPA table, a skewed-associative cache that implements an insertion algorithm that realigns the translations in the table to maximize the utilization of the available entries is implemented.

CROSS REFERENCE TO RELATED PATENT APPLICATIONS

[0001] This patent application is a continuation-in-part of copending,commonly assigned patent application Ser. No. 08/924,385, “HierarchicalComputer System” by Erik E. Hagersten, filed Sep. 5, 1997, thedisclosure of which is incorporated herein by reference in its entirety.

BACKGROUND OF THE INVENTION

[0002] 1. Field of the Invention

[0003] This invention relates to the field of translation tables and,more particularly, to skewed hashing functions employed withintranslation tables of multiprocessor computer systems.

[0004] 2. Description of the Relevant Art

[0005] Multiprocessing computer systems include two or more processorswhich may be employed to perform computing tasks. A particular computingtask may be performed upon one processor while other processors performunrelated computing tasks. Alternatively, components of a particularcomputing task may be distributed among multiple processors to decreasethe time required to perform the computing task as a whole. Generallyspeaking, a processor is a device configured to perform an operationupon one or more operands to produce a result. The operation isperformed in response to an instruction executed by the processor.

[0006] A popular architecture in commercial multiprocessing computersystems is the symmetric multiprocessor (SMP) architecture. Typically,an SMP computer system comprises multiple processors connected through acache hierarchy to a shared bus. Additionally connected to the bus is amemory, which is shared among the processors in the system. Access toany particular memory location within the memory occurs in a similaramount of time as access to any other particular memory location. Sinceeach location in the memory may be accessed in a uniform manner, thisstructure is often referred to as a uniform memory architecture (UMA).

[0007] Processors are often configured with internal caches, and one ormore caches are typically included in the cache hierarchy between theprocessors and the shared bus in an SMP computer system. Multiple copiesof data residing at a particular main memory address may be stored inthese caches. In order to maintain the shared memory model, in which aparticular address stores exactly one data value at any given time,shared bus computer systems employ cache coherency. Generally speaking,an operation is coherent if the effects of the operation upon datastored at a particular memory address are reflected in each copy of thedata within the cache hierarchy. For example, when data stored at aparticular memory address is updated, the update may be supplied to thecaches which are storing copies of the previous data. Alternatively, thecopies of the previous data may be invalidated in the caches such that asubsequent access to the particular memory address causes the updatedcopy to be transferred from main memory. For shared bus systems, a snoopbus protocol is typically employed. Each coherent transaction performedupon the shared bus is examined (or “snooped”) against data in thecaches. If a copy of the affected data is found, the state of the cacheline containing the data may be updated in response to the coherenttransaction.

[0008] Unfortunately, shared bus architectures suffer from severaldrawbacks which limit their usefulness in multiprocessing computersystems. A bus is capable of a peak bandwidth (e.g. a number ofbytes/second which may be transferred across the bus). As additionalprocessors are attached to the bus, the bandwidth required to supply theprocessors with data and instructions may exceed the peak bus bandwidth.Since some processors are forced to wait for available bus bandwidth,performance of the computer system suffers when the bandwidthrequirements of the processors exceeds available bus bandwidth.

[0009] Additionally, adding more processors to a shared bus increasesthe capacitive loading on the bus and may even cause the physical lengthof the bus to be increased. The increased capacitive loading andextended bus length increases the delay in propagating a signal acrossthe bus. Due to the increased propagation delay, transactions may takelonger to perform. Therefore, the peak bandwidth of the bus may decreaseas more processors are added.

[0010] These problems are further magnified by the continued increase inoperating frequency and performance of processors. The increasedperformance enabled by the higher frequencies and more advancedprocessor microarchitectures results in higher bandwidth requirementsthan previous processor generations, even for the same number ofprocessors. Therefore, buses which previously provided sufficientbandwidth for a multiprocessing computer system may be insufficient fora similar computer system employing the higher performance processors.

[0011] Another structure for multiprocessing computer systems is adistributed shared memory architecture. A distributed shared memoryarchitecture includes multiple nodes within which processors and memoryreside. The multiple nodes communicate via a network coupled therebetween. When considered as a whole, the memory included within themultiple nodes forms the shared memory for the computer system.Typically, directories are used to identify which nodes have cachedcopies of data corresponding to a particular address. Coherencyactivities may be generated via examination of the directories.

[0012] Distributed shared memory systems are scaleable, overcoming thelimitations of the shared bus architecture. Since many of the processoraccesses are completed within a node, nodes typically have much lowerbandwidth requirements upon the network than a shared bus architecturemust provide upon its shared bus. The nodes may operate at high clockfrequency and bandwidth, accessing the network when needed. Additionalnodes may be added to the network without affecting the local bandwidthof the nodes. Instead, only the network bandwidth is affected.

[0013] Distributed shared memory system may employ local and globaladdress spaces. A portion of the global address space may be assigned toeach node within the distributed shared memory system. In somedistributed shared memory systems, data corresponding to the address ofremote nodes may be copied to a requesting node's shared memory suchthat future accesses to that data may be performed via localtransactions rather than global transactions. The copied data isreferred to as shadow pages. In such systems, CPU's local to the nodemay use the local physical address may be assigned to shadow pages.Address translation tables are provided to translate between the globaladdress and the local physical address assigned to the shadow pages. Indistributed shared memory systems with large address spaces, thetranslation tables used to translate between global addresses and localphysical addresses can become very large. For example, in a distributedshared memory system with four nodes with 1M pages per node, a globaladdress to local physical address translation table may include 4Mentries. In some systems, the access time of such a large translationtable may add unacceptable delay to a memory transaction.

[0014] To reduce the latency and the implementation cost associated witha global address to local physical address translation, some distributedshared memory systems employ a cache for storing the most recentlyaccessed translations. The cache reduces the propagation delay fortranslations stored in the cache. Cache misses, however, add significantlatency and the cache adds significant complexity to the translationtable.

[0015] To decrease the number of cache misses, the size of the cache maybe increased or the cache may be made set associative. Associativecaches trade-off utilization for access time. In other words, the higherthe associatively of a cache, the longer the access time. For example, afully associative cache may approach 100% utilization. However, theaccess time of a fully associative cache is relatively long because eachentry in the cache may be queried for the desired data. Alternatively, adirect mapped cache has a relatively short access time (only one entryis accessed), but the utilization of a direct mapped cache may berelatively low. A look-up table with high utilization and short accesstimes is desirable.

SUMMARY OF THE INVENTION

[0016] The problems outlined above are in large part solved by askewed-associative table that implements an insertion algorithm tomaximize the utilization of the table. In one embodiment, an inputaddress is converted to two look-up addresses using one or more indexfunctions. The look-up addresses address a primary entry associated withthe input address and a secondary entry associated with the inputaddress. Only the primary entry and the secondary entry need to beaccessed during a table look-up. An insertion algorithm maximizes theutilization of the table by realigning the data stored in the table tomake an entry available for new data. For example, if a primary entryand secondary entry associated with an input address are occupied byother entries, the insertion algorithm will move the data stored ineither the primary entry or the secondary entry to an alternative entryfor that data. By moving the data to an alternative entry, the entry ismade available to store the new data. If the alternative entries for thedata stored in the primary entry and secondary entry is unavailable, thedata stored in the alternative entries are stored in an alternativeentry for that data. The data in the primary entry or secondary entry isthen stored to its alternative entry and the entry is made available tostore the new data. Accordingly, the insertion algorithm increases theutilization of the table to approach the utilization of a fullyassociative table while the access time of the table is similar to atwo-way set-associative table. It is noted that the present inventionapplies to caches as well as tables.

[0017] Broadly speaking the present invention contemplates a look-uptable configured to store and output data corresponding to inputaddresses. The lookup table includes a plurality of entries for storingthe data and a look-up address circuit. The look-up address circuit isconfigured to receive the input address and includes a first indexfunction circuit and a second index function circuit. The first indexfunction circuit is configured to convert a first input address to aprimary look-up address that corresponds to the first input address,wherein a primary entry of the plurality of entries is addressed by theprimary look-up address. The second index function circuit is configuredto convert the first input address to a secondary look-up address thatcorresponds to the first input address, wherein a secondary entry of theplurality of entries is addressed by the secondary look-up address. Thelook-up table is configured to store a first datum to the primary entryif the primary entry is available and to store the first datum to thesecondary entry if the primary entry is unavailable. If the primaryentry and the secondary entry are unavailable, the look-up table isconfigured to move a second datum stored in the primary entry (orsecondary entry) to an alternate entry for the second datum and to storethe first datum to the primary (or secondary entry) entry.

[0018] The present invention further contemplates a method of storingand retrieving data in a look-up table wherein the data corresponds toinput addresses and each input address corresponds to a primary entryand a secondary entry of the look-up table comprising: if a primaryentry corresponding to a first input address is available, storing afirst datum to the primary entry; if the primary entry is unavailable,storing the first datum to a secondary entry corresponding to the firstinput address; if the primary entry and the secondary entry areunavailable, moving a second datum stored in the primary entry (orsecondary entry) to an alternate entry of the second datum and storingthe first datum to the primary entry (or secondary entry).

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] Other objects and advantages of the invention will becomeapparent upon reading the following detailed description and uponreference to the accompanying drawings in which:

[0020]FIG. 1 is a block diagram of a multiprocessor computer system.

[0021]FIG. 1A is a conceptualized block diagram depicting a non-uniformmemory architecture supported by one embodiment of the computer systemshown in FIG. 1.

[0022]FIG. 1B is a conceptualized block diagram depicting a cache-onlymemory architecture supported by one embodiment of the computer systemshown in FIG. 1.

[0023]FIG. 2 is a block diagram of one embodiment of a symmetricmultiprocessing node depicted in FIG. 1.

[0024]FIG. 2A is an exemplary directory entry stored in one embodimentof a directory depicted in FIG. 2.

[0025]FIG. 3 is a block diagram of one embodiment of a system interfaceshown in FIG. 1.

[0026]FIG. 4 is a mapping of a physical address space and a logicaladdress space of a four-node multiprocessing computer system accordingto one embodiment of the present invention.

[0027]FIG. 5 illustrates a local physical address according to oneembodiment of the present invention.

[0028]FIG. 6 illustrates of a directory entry according to oneembodiment of the present invention.

[0029]FIG. 7 is a block diagram illustrating a list of free memory and alist of CMR memory.

[0030]FIG. 8 is a block diagram illustrating an organization of a localmemory and the mapping of pages within the local memory to entries in alocal physical address to global address translation table.

[0031]FIG. 9 is a diagram illustrating the translation of a localphysical address to a global address according to one embodiment of thepresent invention.

[0032]FIG. 10 illustrates an entry of a local physical address to globaladdress translation table according to one embodiment of the presentinvention.

[0033]FIG. 11 is a block diagram illustrating an organization of aglobal address to local physical address translation table according toone embodiment of the present invention.

[0034]FIG. 12A is a block diagram illustrating an alternativeorganization of a global address to local physical address translationtable according to one embodiment of the present invention.

[0035]FIG. 12B is a diagram illustrating an example of realigning tableentries within a global address to local physical address translationtable according to one embodiment of the present invention.

[0036]FIG. 12C is a diagram illustrating another example of realigningtable entries within a global address to local physical addresstranslation table according to one embodiment of the present invention.

[0037]FIG. 13 is a diagram illustrating the translation of a globaladdress to a local physical address according to one embodiment of thepresent invention.

[0038]FIG. 14A is a flowchart illustrating the allocation of entries ina global address to local physical address table according to oneembodiment of the present invention.

[0039]FIG. 14B is a flowchart illustrating the allocation of coherentreplication memory according to one embodiment of the present invention.

[0040]FIG. 14C is a flowchart illustrating the realignment of entries ina global address to local physical address table according to oneembodiment of the present invention.

[0041] While the invention is susceptible to various modifications andalternative forms, specific embodiments thereof are shown by way ofexample in the drawings and will herein be described in detail. Itshould be understood, however, that the drawings and detaileddescription thereto are not intended to limit the invention to theparticular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thespirit and scope of the present invention as defined by the appendedclaims.

DETAILED DESCRIPTION OF THE INVENTION

[0042] Turning now to FIG. 1, a block diagram of one embodiment of amultiprocessing computer system 10 is shown. Computer system 10 includesmultiple SMP nodes 12A-12D interconnected by a point-to-point network14. Elements referred to herein with a particular reference numberfollowed by a letter will be collectively referred to by the referencenumber alone. For example, SMP nodes 12A-12D will be collectivelyreferred to as SMP nodes 12. In the embodiment shown, each SMP node 12includes multiple processors, external caches, an SMP bus, a memory, anda system interface. For example, SMP node 12A is configured withmultiple processors including processors 16A-16B. The processors 16 areconnected to external caches 18, which are further coupled to an SMP bus20. Additionally, a memory 22 and a system interface 24 are coupled toSMP bus 20. Still further, one or more input/output (I/O) interfaces 26may be coupled to SMP bus 20. I/O interfaces 26 are used to interface toperipheral devices such as serial and parallel ports, disk drives,modems, printers, etc. Other SMP nodes 12B-12D may be configuredsimilarly.

[0043] Generally speaking, the memory, or physical address space, of acomputer system is distributed among SMP nodes 12A-12D. The memoryassigned to a node is referred to as the local memory of that node.Typically, accesses to a node's local memory are local transactions andaccesses to other node's local memory are global transactions. In oneembodiment, a node may store a shadow copy of data from another node'slocal memory (the node which stores the original data is referred to asthe home node). Accordingly, accesses to the shadow copy of data may beperformed locally rather than accessing the data from the home node.When a shadow copy of a data is stored to a local node, the data isassigned an address within the local physical address space of the localnode. Although data accesses to a shadow copy may be local, coherencyoperations are typically still global. For example, if a local nodeattempts to write to a shadow copy without sufficient access rights, aglobal coherency operation, such as a write invalidation operation, isperformed to obtain write access rights to the data. When a coherencyoperation is performed, the local physical address assigned to theshadow copy is translated to the global address of the data using theLPA2GA table.

[0044] A coherence operation from another node arriving at a node with ashadow copy of the requested page will need the reverse translation froma global address to a local physical address. Generally speaking, aglobal address to local physical address translation (GA2LPA) table maybe implemented as a two-way set-associative cache that uses an insertionalgorithm to maximize the utilization of the table. During a tablelook-up, only two entries need to be accessed which keeps the accesstime of the table to a minimum. The insertion algorithm, however,maximizes the utilization of the table by realigning data stored in thetable if an entry for new data is not available.

[0045] In one embodiment, the insertion algorithm of the GA2LPAtranslation table is implemented by software. The insertion algorithmfirst determines if the primary entry is available. If the primary entryis available, the translation is stored to the primary entry. If theprimary entry is unavailable, the insertion algorithm determines if thesecondary entry is available. If the secondary entry is available, thetranslation is stored to the secondary entry. If both the primary andsecondary entries are unavailable, the insertion algorithm makes anentry available for the new translation by realigning the translationsstored in the table. First, the translation that occupies the primary(or secondary) entry both have alternative locations. Accordingly, oneof the translations may be moved to an alternate entry, which makes anentry available for storing the new translation. If the alternativelocation of the translation stored in the primary (or secondary) entryis unavailable, the translation in the alternate entry of the alternateentry of the translation stored in the primary (or secondary) entry maybe moved to its alternate. This makes the alternative entry of thetranslation stored in the primary (or secondary) entry available and thetranslation is moved to its alternate entry. The primary (or secondary)entry is then available for storing the new translation. Severaliterations of the above methodology may be repeated before an entry isavailable for the new translation.

[0046] As used herein, a memory operation is an operation causing thetransfer of data from a source to a destination. The source and/ordestination may be storage locations within the initiator, or may bestorage locations within memory. When a source or destination is astorage location within memory, the source or destination is specifiedvia an address conveyed with the memory operation. Memory operations maybe read or write operations. A read operation causes transfer of datafrom a source outside of the initiator to a destination within theinitiator. Conversely, a write operation causes transfer of data from asource within the initiator to a destination outside of the initiator.In the computer system shown in FIG. 1, a memory operation may includeone or more transactions upon SMP bus 20 as well as one or morecoherency operations upon network 14.

[0047] Each SMP node 12 is essentially an SMP system having memory 22 asthe shared memory. Processors 16 are high performance processors. In oneembodiment, each processor 16 is a SPARC processor compliant withversion 9 of the SPARC processor architecture. It is noted, however,that any processor architecture may be employed by processors 16.

[0048] Typically, processors 16 include internal instruction and datacaches. Therefore, external caches 18 are labeled as L2 caches (forlevel 2, wherein the internal caches are level 1 caches). If processors16 are not configured with internal caches, then external caches 18 arelevel 1 caches. It is noted that the “level” nomenclature is used toidentify proximity of a particular cache to the processing core withinprocessor 16. Level 1 is nearest the processing core, level 2 is nextnearest, etc. External caches 18 provide rapid access to memoryaddresses frequently accessed by the processor 16 coupled thereto. It isnoted that external caches 18 may be configured in any of a variety ofspecific cache arrangements. For example, set-associative ordirect-mapped configurations may be employed by external caches 18.

[0049] SMP bus 20 accommodates communication between processors 16(through caches 18), memory 22, system interface 24, and I/O interface26. In one embodiment, SMP bus 20 includes an address bus and relatedcontrol signals, as well as a data bus and related control signals. Asplit-transaction bus protocol may be employed upon SMP bus 20.Generally speaking, a split-transaction bus protocol is a protocol inwhich a transaction on the bus is implemented by several asynchronousphases. Transactions involving address and data include an address phasein which the address and related control information is conveyed uponthe address bus, and a data phase in which the data is conveyed upon thedata bus. Additional address phases and/or data phases for othertransactions may be initiated prior to the data phase corresponding to aparticular address phase. An address phase and the corresponding dataphase may be correlated in a number of ways. For example, datatransactions may occur in the same order that the address transactionsoccur. Alternatively, address and data phases of a transaction may beidentified via a unique tag.

[0050] Memory 22 is configured to store data and instruction code foruse by processors 16. Memory 22 preferably comprises dynamic randomaccess memory (DRAM), although any type of memory may be used. Memory22, in conjunction with similar illustrated memories in the other SMPnodes 12, forms a distributed shared memory system. Each address in theaddress space of the distributed shared memory is assigned to aparticular node, referred to as the home node of the address. Aprocessor within a different node than the home node may access the dataat an address of the home node, potentially caching the data. Therefore,coherency is maintained between SMP nodes 12 as well as among processors16 and caches 18 within a particular SMP node 12A-12D. System interface24 provides internode coherency, while snooping upon SMP bus 20 providesintranode coherency.

[0051] In addition to maintaining internode coherency, system interface24 detects addresses upon SMP bus 20 which require a data transfer to orfrom another SMP node 12. System interface 24 performs the transfer, andprovides the corresponding data for the transaction upon SMP bus 20. Inthe embodiment shown, system interface 24 is coupled to a point-to-pointnetwork 14. However, it is noted that in alternative embodiments othernetworks may be used. In a point-to-point network, individualconnections exist between each node upon the network. A particular nodecommunicates directly with a second node via a dedicated link. Tocommunicate with a third node, the particular node utilizes a differentlink than the one used to communicate with the second node.

[0052] It is noted that, although four SMP nodes 12 are shown in FIG. 1,embodiments of computer system 10 employing any number of nodes arecontemplated.

[0053]FIGS. 1A and 1B are conceptualized illustrations of distributedmemory architectures supported by one embodiment of computer system 10.Specifically, FIGS. 1A and 1B illustrate alternative ways in which eachSMP node 12 of FIG. 1 may cache data and perform memory accesses.Details regarding the manner in which computer system 10 supports suchaccesses will be described in further detail below.

[0054] Turning now to FIG. 1A, a logical diagram depicting a firstmemory architecture 30 supported by one embodiment of computer system 10is shown. Architecture 30 includes multiple processors 32A-32D, multiplecaches 34A-34D, multiple memories 36A-36D, and an interconnect network38. The multiple memories 36 form a distributed shared memory. Eachaddress within the address space corresponds to a location within one ofmemories 36.

[0055] Architecture 30 is a non-uniform memory architecture (NUMA). In aNUMA architecture, the amount of time required to access a first memoryaddress may be substantially different than the amount of time requiredto access a second memory address. The access time depends upon theorigin of the access and the location of the memory 36A-36D which storesthe accessed data. For example, if processor 32A accesses a first memoryaddress stored in memory 36A, the access time may be significantlyshorter than the access time for an access to a second memory addressstored in one of memories 36B-36D. That is, an access by processor 32Ato memory 36A may be completed locally (e.g. without transfers uponnetwork 38), while a processor 32A access to memory 36B is performed vianetwork 38. Typically, an access through network 38 is slower than anaccess completed within a local memory. For example, a local accessmight be completed in a few hundred nanoseconds while an access via thenetwork might occupy a few microseconds.

[0056] Data corresponding to addresses stored in remote nodes may becached in any of the caches 34. However, once a cache 34 discards thedata corresponding to such a remote address, a subsequent access to theremote address is completed via a transfer upon network 38.

[0057] NUMA architectures may provide excellent performancecharacteristics for software applications which use addresses thatcorrespond primarily to a particular local memory. Software applicationswhich exhibit more random access patterns and which do not confine theirmemory accesses to addresses within a particular local memory, on theother hand, may experience a large amount of network traffic as aparticular processor 32 performs repeated accesses to remote nodes.

[0058] Turning now to FIG. 1B, a logic diagram depicting a second memoryarchitecture 40 supported by the computer system 10 of FIG. 1 is shown.Architecture 40 includes multiple processors 42A-42D, multiple caches44A-44D, multiple memories 46A-46D, and network 48. However, memories 46are logically coupled between caches 44 and network 48. Memories 46serve as larger caches (e.g. a level 3 cache), storing addresses whichare accessed by the corresponding processors 42. Memories 46 are said to“attract” the data being operated upon by a corresponding processor 42.As opposed to the NUMA architecture shown in FIG. 1A, architecture 40reduces the number of accesses upon the network 48 by storing remotedata in the local memory when the local processor accesses that data.The remote data stored in local memory is referred to herein as shadowpages of the remote data.

[0059] Architecture 40 is referred to as a cache-only memoryarchitecture (COMA). Multiple locations within the distributed sharedmemory formed by the combination of memories 46 may store datacorresponding to a particular address. No permanent mapping of aparticular address to a particular storage location is assigned.Instead, the location storing data corresponding to the particularaddress changes dynamically based upon the processors 42 which accessthat particular address. Conversely, in the NUMA architecture aparticular storage location within memories 46 is assigned to aparticular address. Architecture 40 adjusts to the memory accesspatterns performed by applications executing thereon, and coherency ismaintained between the memories 46.

[0060] In a preferred embodiment, computer system 10 supports both ofthe memory architectures shown in FIGS. 1A and 1B. In particular, amemory address may be accessed in a NUMA fashion from one SMP node12A-12D while being accessed in a COMA manner from another SMP node12A-12D. In one embodiment, a NUMA access is detected if the node IDbits of the address upon SMP bus 20 identify another SMP node 12 as thehome node of the address presented. Otherwise, a COMA access ispresumed. Additional details will be provided below. In one embodiment,a data accessed in a COMA manner is stored as a shadow page within thenode accessing the data.

[0061] In one embodiment, the COMA architecture is implemented using acombination of hardware and software techniques. Hardware maintainscoherency between the locally cached copies of pages, and software (e.g.the operating system employed in computer system 10) is responsible fordeallocating and allocating cached pages.

[0062]FIG. 2 depicts details of one implementation of an SMP node 12Athat generally conforms to the SMP node 12A shown in FIG. 1. Other nodes12 may be configured similarly. It is noted that alternative specificimplementations of each SMP node 12 of FIG. 1 are also possible. Theimplementation of SMP node 12A shown in FIG. 2 includes multiplesubnodes such as subnodes 50A and 50B. Each subnode 50 includes twoprocessors 16 and corresponding caches 18, a memory portion 56, anaddress controller 52, and a data controller 54. The memory portions 56within subnodes 50 collectively form the memory 22 of the SMP node 12Aof FIG. 1. Other subnodes (not shown) are further coupled to SMP bus 20to form the I/O interfaces 26.

[0063] As shown in FIG. 2, SMP bus 20 includes an address bus 58 and adata bus 60. Address controller 52 is coupled to address bus 58, anddata controller 54 is coupled to data bus 60. FIG. 2 also illustratessystem interface 24, including a system interface logic block 62, atranslation storage 64, a directory 66, and a memory tag (MTAG) 68.Logic block 62 is coupled to both address bus 58 and data bus 60, andasserts an ignore signal 70 upon address bus 58 under certaincircumstances as will be explained further below. Additionally, logicblock 62 is coupled to translation storage 64, directory 66, MTAG 68,and network 14.

[0064] For the embodiment of FIG. 2, each subnode 50 is configured upona printed circuit board which may be inserted into a backplane uponwhich SMP bus 20 is situated. In this manner, the number of processorsand/or I/O interfaces 26 included within an SMP node 12 may be varied byinserting or removing subnodes 50. For example, computer system 10 mayinitially be configured with a small number of subnodes 50. Additionalsubnodes 50 may be added from time to time as the computing powerrequired by the users of computer system 10 grows.

[0065] Address controller 52 provides an interface between caches 18 andthe address portion of SNIP bus 20. In the embodiment shown, addresscontroller 52 includes an out queue 72 and some number of in queues 74.Out queue 72 buffers transactions from the processors connected theretountil address controller 52 is granted access to address bus 58. Addresscontroller 52 performs the transactions stored in out queue 72 in theorder those transactions were placed into out queue 72 (i.e. out queue72 is a FIFO queue). Transactions performed by address controller 52 aswell as transactions received from address bus 58 which are to besnooped by caches 18 and caches internal to processors 16 are placedinto in queue 74.

[0066] Similar to out queue 72, in queue 74 is a FIFO queue. All addresstransactions are stored in the in queue 74 of each subnode 50 (evenwithin the in queue 74 of the subnode 50 which initiates the addresstransaction). Address transactions are thus presented to caches 18 andprocessors 16 for snooping in the order they occur upon address bus 58.The order that transactions occur upon address bus 58 is the order forSMP node 12A. However, the complete system is expected to have oneglobal memory order. This ordering expectation creates a problem in boththe NUMA and COMA architectures employed by computer system 10, sincethe global order may need to be established by the order of operationsupon network 14. If two nodes perform a transaction to an address, theorder that the corresponding coherency operations occur at the home nodefor the address defines the order of the two transactions as seen withineach node. For example, if two write transactions are performed to thesame address, then the second write operation to arrive at the address'home node should be the second write transaction to complete (i.e. abyte location which is updated by both write transactions stores a valueprovided by the second write transaction upon completion of bothtransactions). However, the node which performs the second transactionmay actually have the second transaction occur first upon SMP bus 20.Ignore signal 70 allows the second transaction to be transferred tosystem interface 24 without any of the CPU's or I/O devices in the SMPnode 12 reacting to the transaction.

[0067] Therefore, in order to operate effectively with the orderingconstraints imposed by the out queue/in queue structure of addresscontroller 52, system interface logic block 62 employs ignore signal 70.When a transaction is presented upon address bus 58 and system interfacelogic block 62 detects that a remote transaction is to be performed inresponse to the transaction, logic block 62 asserts the ignore signal70. Assertion of the ignore signal 70 with respect to a transactioncauses address controller 52 to inhibit storage of the transaction intoin queues 74. Therefore, other transactions which may occur subsequentto the ignored transaction and which complete locally within SMP node12A may complete out of order with respect to the ignored transactionwithout violating the ordering rules of in queue 74. In particular,transactions performed by system interface 24 in response to coherencyactivity upon network 14 may be performed and completed subsequent tothe ignored transaction. When a response is received from the remotetransaction, the ignored transaction may be reissued by system interfacelogic block 62 upon address bus 58. The transaction is thereby placedinto in queue 74, and may complete in order with transactions occurringat the time of reissue.

[0068] It is noted that in one embodiment, once a transaction from aparticular address controller 52 has been ignored, subsequent coherenttransactions from that particular address controller 52 are alsoignored. Transactions from a particular processor 16 may have animportant ordering relationship with respect to each other, independentof the ordering requirements imposed by presentation upon address bus58. For example, a transaction may be separated from another transactionby a memory synchronizing instruction such as the MEMBAR instructionincluded in the SPARC architecture. The processor 16 conveys thetransactions in the order the transactions are to be performed withrespect to each other.

[0069] Data controller 54 routes data to and from data bus 60, memoryportion 56 and caches 18. Data controller 54 may include in and outqueues similar to address controller 52. In one embodiment, datacontroller 54 employs multiple physical units in a byte-sliced busconfiguration.

[0070] Processors 16 as shown in FIG. 2 include memory management units(MMUs) 76A-76B. MMUs 76 perform a virtual to physical addresstranslation upon the data addresses generated by the instruction codeexecuted upon processors 16, as well as the instruction addresses. Theaddresses generated in response to instruction execution are virtualaddresses. In other words, the virtual addresses are the addressescreated by the programmer of the instruction code. The virtual addressesare passed through an address translation mechanism (embodied in MMUs76), from which corresponding physical addresses are created. Thephysical address identifies a storage location within memory 22.

[0071] Virtual to physical address translation is performed for manyreasons. For example, the address translation mechanism may be used togrant or deny a particular computing task's access to certain memoryaddresses. In this manner, the data and instructions within onecomputing task are isolated from the data and instructions of anothercomputing task. Additionally, portions of the data and instructions of acomputing task may be “paged out” to a hard disk drive. When a portionis paged out, the translation is invalidated. Upon access to the portionby the computing task, an interrupt occurs due to the failedtranslation. The interrupt allows the operating system to retrieve thecorresponding information from the hard disk drive. In this manner, morevirtual memory may be available than actual memory in memory 22. Manyother uses for virtual memory are well known.

[0072] Referring back to computer system 10 shown in FIG. 1 inconjunction with the SMP node 12A implementation illustrated in FIG. 2,the physical address computed by MMUs 76 may be a local physical address(LPA) defining a location within the memory 22 associated with the SMPnode 12 in which the processor 16 is located. MTAG 68 stores a coherencystate for each “coherency unit” in memory 22. When an addresstransaction is performed upon SMP bus 20, system interface logic block62 examines the coherency state stored in MTAG 68 for the accessedcoherency unit. If the coherency state indicates that the SMP node 12has sufficient access rights to the coherency unit to perform theaccess, then the address transaction proceeds. If, however, thecoherency state indicates that coherency operations should be performedprior to completion of the transaction, then system interface logicblock 62 asserts the ignore signal 70. Logic block 62 performs coherencyoperations upon network 14 to acquire the appropriate coherency state.When the appropriate coherency state is acquired, logic block 62reissues the ignored transaction upon SMP bus 20. Subsequently, thetransaction completes.

[0073] Generally speaking, the coherency state maintained for acoherency unit at a particular storage location (e.g. a cache or amemory 22) indicates the access rights to the coherency unit at that SMPnode 12. The access right indicates the validity of the coherency unit,as well as the read/write permission granted for the copy of thecoherency unit within that SMP node 12. In one embodiment, the coherencystates employed by computer system 10 are modified, owned, shared, andinvalid. The modified state indicates that the SMP node 12 has updatedthe corresponding coherency unit. Therefore, other SMP nodes 12 do nothave a copy of the coherency unit. Additionally, when the modifiedcoherency unit is discarded by the SMP node 12, the coherency unit isstored back to the home node. The owned state indicates that the SMPnode 12 is responsible for the coherency unit, but other SMP nodes 12may have shared copies. Again, when the coherency unit is discarded bythe SMP node 12, the coherency unit is stored back to the home node. Theshared state indicates that the SMP node 12 may read the coherency unitbut may not update the coherency unit without acquiring the owned state.Additionally, other SMP nodes 12 may have copies of the coherency unitas well. Finally, the invalid state indicates that the SMP node 12 doesnot have a copy of the coherency unit. In one embodiment, the modifiedstate indicates write permission and any state but invalid indicatesread permission to the corresponding coherency unit.

[0074] As used herein, a coherency unit is a number of contiguous bytesof memory which are treated as a unit for coherency purposes. Forexample, if one byte within the coherency unit is updated, the entirecoherency unit is considered to be updated. In one specific embodiment,the coherency unit is a cache line, comprising 64 contiguous bytes. Itis understood, however, that a coherency unit may comprise any number ofbytes.

[0075] System interface 24 also includes a translation mechanism whichutilizes translation storage 64 to store translations from a localphysical address (LPA) to a global address (GA), and from a GA back to aLPA. Certain bits within a physical address identify the home node forthe address, at which coherency information is stored for that globaladdress. For example, an embodiment of computer system 10 may employfour SMP nodes 12 such as that of FIG. 1. In such an embodiment, twobits of the physical address identify the home node. Preferably, bitsfrom the most significant portion of the physical address are used toidentify the home node. The same bits are used in the physical addressto identify NUMA accesses. If the bits of the physical address indicatethat the local node is not the home node, then the physical address is aglobal address and the transaction is performed in NUMA mode. Therefore,the operating system places global addresses in MMUs 76 for anyNUMA-type pages. Conversely, the operating system places LPAs in MMU 76for any COMA-type pages. It is noted that an LPA may be the same as a GA(for NUMA accesses to remote address and accesses to addresses allocatedto local memory). Alternatively, an LPA may be translated to a GA whenthe LPA identifies storage locations that store copies of data having ahome in another SMP node 12, i.e. shadow pages.

[0076] The directory 66 of a particular home node identifies which SMPnodes 12 have copies of data corresponding to a given physical addressassigned to the home node such that coherency between the copies may bemaintained. Additionally, the directory 66 of the home node identifiesthe SMP node 12 which owns the coherency unit. Therefore, while localcoherency between caches 18 and processors 16 is maintained viasnooping, system-wide (or global) coherency is maintained using MTAG 68and directory 66. Directory 66 stores the coherency informationcorresponding to the coherency units which are assigned to SMP node 12A(i.e. for which SMP node 12A is the home node).

[0077] It is noted that for the embodiment of FIG. 2, directory 66 andMTAG 68 store information for each coherency unit (i.e., on a coherencyunit basis). Conversely, translation storage 64 stores local physical toglobal address translations defined for pages. A page includes multiplecoherency units, and is typically several kilobytes or even megabytes insize.

[0078] Computer system 10 accordingly creates local physical address toglobal address translations on a page basis (thereby allocating a localmemory page for storing a copy of a remotely stored global page).Therefore, blocks of memory 22 are allocated to a particular globaladdress on a page basis as well. However, as stated above, coherencystates and coherency activities are performed upon a coherency unit.Therefore, when a page is allocated in memory to a particular globaladdress, the data corresponding to the page is not necessarilytransferred to the allocated memory. Instead, as processors 16 accessvarious coherency units within the page, those coherency units aretransferred from the owner of the coherency unit. In this manner, thedata actually accessed by SMP node 12A is transferred into thecorresponding memory 22. Data not accessed by SMP node 12A may not betransferred, thereby reducing overall bandwidth usage upon network 14 incomparison to embodiments which transfer the page of data uponallocation of the page in memory 22.

[0079] It is noted that in one embodiment, translation storage 64,directory 66, and/or MTAG 68 may be caches which store only a portion ofthe associated translation, directory, and MTAG information,respectively. The entirety of the translation, directory, and MTAGinformation may be stored in tables within memory 22 or a dedicatedmemory storage (not shown). If required information for an access is notfound in the corresponding cache, the tables are accessed by systeminterface 24.

[0080] Turning now to FIG. 2A, an exemplary directory entry 71 is shown.Directory entry 71 may be employed by one embodiment of directory 66shown in FIG. 2. Other embodiments of directory 66 may employ dissimilardirectory entries. Directory entry 71 includes a valid bit 73, a writeback bit 75, an owner field 77, and a sharers field 79. Directory entry71 resides within the table of directory entries, and is located withinthe table via the global address identifying the corresponding coherencyunit. More particularly, the directory entry 71 associated with acoherency unit is stored within the table of directory entries at anoffset formed from the global address which identifies the coherencyunit.

[0081] Valid bit 73 indicates, when set, that directory entry 71 isvalid (i.e. that directory entry 71 is storing coherency information fora corresponding coherency unit). When clear, valid bit 73 indicates thatdirectory entry 71 is invalid.

[0082] Owner field 77 identifies one of SMP nodes 12 as the owner of thecoherency unit. The owning SMP node 12A-12D maintains the coherency unitin either the modified or owned states. Typically, the owning SMP node12A-12D acquires the coherency unit in the modified state. Subsequently,the owning SMP node 12A-12D may then transition to the owned state uponproviding a copy of the coherency unit to another SMP node 12A-12D. Theother SMP node 12A-12D acquires the coherency unit in the shared state.In one embodiment, owner field 77 comprises two bits encoded to identifyone of four SMP nodes 12A-12D as the owner of the coherency unit.

[0083] Sharers field 79 includes one bit assigned to each SMP node12A-12D. If an SMP node 12A-12D is maintaining a shared copy of thecoherency unit, the corresponding bit within sharers field 79 is set.Conversely, if the SMP node 12A-12D is not maintaining a shared copy ofthe coherency unit, the corresponding bit within sharers field 79 isclear. In this manner, sharers field 79 indicates all of the sharedcopies of the coherency unit which exist within the computer system 10of FIG. 1.

[0084] Write back bit 75 indicates, when set, that the SMP node 12A-12Didentified as the owner of the coherency unit via owner field 77 haswritten the updated copy of the coherency unit to the home SMP node 12.When clear, bit 75 indicates that the owning SMP node 12A-12D has notwritten the updated copy of the coherency unit to the home SMP node12A-12D.

[0085] Turning now to FIG. 3, a block diagram of one embodiment ofsystem interface 24 is shown. As shown in FIG. 3, system interface 24includes directory 66, translation storage 64, and MTAG 68. Translationstorage 64 is shown as a global address to local physical address(GA2LPA) translation unit 80 and a local physical address to globaladdress (LPA2GA) translation unit 82.

[0086] System interface 24 also includes input and output queues forstoring transactions to be performed upon SMP bus 20 or network 14.Specifically, for the embodiment shown, system interface 24 includesinput header queue 84 and output header queue 86 for buffering headerpackets to and from network 14. Header packets identify an operation tobe performed, and specify the number and format of any data packetswhich may follow. Output header queue 86 buffers header packets to betransmitted upon network 14, and input header queue 84 buffers headerpackets received from network 14 until system interface 24 processes thereceived header packets. Similarly, data packets are buffered in inputdata queue 88 and output data queue 90 until the data may be transferredupon SMP data bus 60 and network 14, respectively.

[0087] SMP out queue 92, SMP in queue 94, and SMP I/O in queue (PIQ) 96are used to buffer address transactions to and from address bus 58. SMPout queue 92 buffers transactions to be presented by system interface 24upon address bus 58. Reissue transactions queued in response to thecompletion of coherency activity with respect to an ignored transactionare buffered in SMP out queue 92. Additionally, transactions generatedin response to coherency activity received from network 14 are bufferedin SMP out queue 92. SMP in queue 94 stores coherency relatedtransactions to be serviced by system interface 24. Conversely, SMP PIQ96 stores I/O transactions to be conveyed to an I/O interface residingin another SMP node 12. I/O transactions generally are considerednon-coherent and therefore do not generate coherency activities.

[0088] SMP in queue 94 and SMP PIQ 96 receive transactions to be queuedfrom a transaction filter 98. Transaction filter 98 is coupled to MTAG68 and SMP address bus 58. If transaction filter 98 detects an I/Otransaction upon address bus 58 which identifies an I/O interface uponanother SMP node 12, transaction filter 98 places the transaction intoSMP PIQ 96. If a coherent transaction to an LPA address is detected bytransaction filter 98, then the corresponding coherency state from MTAG68 is examined. In accordance with the coherency state, transactionfilter 98 may assert ignore signal 70 and may queue a coherencytransaction in SMP in queue 94. Ignore signal 70 is asserted and acoherency transaction queued if MTAG 68 indicates that insufficientaccess rights to the coherency unit for performing the coherenttransaction is maintained by SMP node 12A. Conversely, ignore signal 70is deasserted and a coherency transaction is not generated if MTAG 68indicates that a sufficient access right is maintained by SMP node 12A.

[0089] Transactions from SMP in queue 94 are processed by a requestagent 100 within system interface 24. Prior to action by request agent100, LPA2GA translation unit 82 translates the address of thetransaction (if it is an LPA address) from the local physical addresspresented upon SMP address bus 58 into the corresponding global address.Request agent 100 then generates a header packet specifying a particularcoherency request to be transmitted to the home node identified by theglobal address. The coherency request is placed into output header queue86. Subsequently, a coherency reply is received into input header queue84. Request agent 100 processes the coherency replies from input headerqueue 84, potentially generating reissue transactions for SMP out queue92 (as described below).

[0090] Also included in system interface 24 is a home agent 102 and aslave agent 104. Home agent 102 processes coherency requests receivedfrom input header queue 84. From the coherency information stored indirectory 66 with respect to a particular global address, home agent 102determines if a coherency demand is to be transmitted to one or moreslave agents in other SMP nodes 12. In one embodiment, home agent 102blocks the coherency information corresponding to the affected coherencyunit. In other words, subsequent requests involving the coherency unitare not performed until the coherency activity corresponding to thecoherency request is completed. According to one embodiment, home agent102 receives a coherency completion from the request agent whichinitiated the coherency request (via input header queue 84). Thecoherency completion indicates that the coherency activity hascompleted. Upon receipt of the coherency completion, home agent 102removes the block upon the coherency information corresponding to theaffected coherency unit. It is noted that, since the coherencyinformation is blocked until completion of the coherency activity, homeagent 102 may update the coherency information in accordance with thecoherency activity performed immediately when the coherency request isreceived.

[0091] Slave agent 104 receives coherency demands from home agents ofother SMP nodes 12 via input header queue 84. In response to aparticular coherency demand, slave agent 104 may queue a coherencytransaction in SMP out queue 92. In one embodiment, the coherencytransaction may cause caches 18 and caches internal to processors 16 toinvalidate the affected coherency unit. Alternatively, the coherencytransaction may cause caches 18 and caches internal to processors 16 tochange the coherency state of the coherency unit to shared. Prior toperforming activities in response to a coherency demand, the globaladdress received with the coherency demand is translated to a localphysical address via GA2LPA translation unit 80. Once slave agent 104has completed activity in response to a coherency demand, slave agent104 transmits a coherency reply to the request agent which initiated thecoherency request corresponding to the coherency demand.

[0092] According to one embodiment, the coherency protocol enforced byrequest agents 100, home agents 102, and slave agents 104 includes awrite invalidate policy. In other words, when a processor 16 within anSMP node 12 updates a coherency unit, any copies of the coherency unitstored within other SMP nodes 12 are invalidated. However, other writepolicies may be used in other embodiments. For example, a write updatepolicy may be employed. According to a write update policy, when ancoherency unit is updated the updated data is transmitted to each of thecopies of the coherency unit stored in each of the SMP nodes 12.

[0093] Referring back to FIG. 2, the verification and acquisition ofcoherency rights according to one embodiment of the present inventionare discussed below. When processor 16 attempts to read or write to amemory location, the MMU within processor 16 converts the virtualaddress generated by the program to a physical address. The physicaladdress includes a node ID field which indicates the home node to whichthe physical address is assigned. If the home node corresponds to thenode which initiates the transaction (i.e. the requesting node), theaddress is referred to as a local physical address. Alternatively, ifthe node ID field identifies a node other than the requesting node, theaddress is referred to as a global address. Using the physical address,processor 16 determines whether the data that corresponds to thephysical address is stored in cache 18. Cache 18 may store datacorresponding to local physical addresses or data corresponding toglobal addresses (data accessed in a NUMA manner may be stored in cachewith a global address). The data corresponding to local physicaladdresses may be one of two types. The local physical address maycorrespond to memory locations assigned to the local node or it maycorrespond to shadow copies of data from a remote node (i.e. dataaccessed in a COMA manner).

[0094] If the data is found in cache 18, processor 16 accesses the datafrom the cache. Alternatively, if the data is not located in cache 18,then a request for the data is output on SMP bus 20. If the physicaladdress is a global address, system interface 24 will initiate a globaltransaction to acquire the desired data. Alternatively, if the physicaladdress is a local physical address, system interface logic 62 willdetermine whether the node has sufficient access rights to perform thetransaction by reading the entry of MTAG 68 that corresponds to address.If the node has sufficient access rights for the desired transaction,the transaction is performed on the data in memory 22. If the node doesnot have sufficient access rights, the node must acquire sufficientaccess rights before performing the transaction. The node obtains theaccess rights by initiating a coherency operation to obtain thoserights.

[0095] In one embodiment, each node includes two logical address spaces.Both logical address spaces are mapped to the entire memory 22 and aresynonyms for accessing the same memory location. A first address space,called CMR space, stores shadow copies of data from other nodes. Theremaining data is stored in a second address space, called local addressspace.

[0096] In one embodiment, a local physical address includes an addressbit, called a CMR bit, that indicates whether the local physical addresscorresponds to an address assigned to the requesting node (i.e., therequesting node is the home node for the data) or to a shadow pagewithin the CMR address space (i.e., a shadow copy of data from a remotenode). If the CMR bit is set, which indicates the data is a shadow page,system interface 24 translates the local physical address to a globaladdress prior to performing a coherency operation. Alternatively, if theCMR bit is clear, which indicates the requesting node is the home nodefor the data, the local physical address is the same as the globaladdress and no translation is necessary prior to performing a coherencyoperation. Addresses with the CMR bit set are mapped to CMR space.Addresses with the CMR bit cleared are mapped to local address space.

[0097] Without the CMR bit, system interface logic 24 cannotdifferentiate between a local physical address that corresponds to localdata and a local physical address that corresponds to a shadow copy ofremote data. Accordingly, system interface 24 will translate all localphysical addresses prior to performing a coherency operation. Becausethe translation is unnecessary for local physical addresses thatcorrespond to local data, the translation adds unnecessary latency tothe transaction and increases the bandwidth that translation storage 64must handle. A protocol for acquiring sufficient access rights isdiscussed in more detail in copending, commonly assigned patentapplication (A Multiprocessing Computer System Employing Local andGlobal Address Spaces And Multiple Access Modes), filed Jul. 1, 1996,Ser. No. 08/675,635, which is herein incorporated by reference in itsentirety.

[0098] Turning now to FIG. 4, a diagram illustrating a physical addressspace and a logical address space of a four node multiprocessingcomputer system according to one embodiment of the present invention isshown. Physical address space 502 is divided among the four nodes of themultiprocessing computer system. Local physical address space 506 is aportion of physical address space 502 allocated to node 0. Localphysical address space 508 is a portion of physical address space 502allocated to node 1. Local physical address space 510 is a portion ofphysical address space 502 allocated to node 2. Local physical addressspace 512 is a portion of physical address space 502 allocated to node3. Each node is the home node for the portion of physical address space502 allocated to the node. In one embodiment, the local physical addressspace allocated to a node is physically located within the node.Accordingly, accesses by a home node to the local physical address spaceallocated to that node are local accesses. For example, if node 0accesses data stored at an address within local physical address space506, the data access is a local transaction. Other nodes, however, mayhave copies of the data, which may necessitate a global coherencyoperation prior to accessing the data.

[0099] In the illustrated embodiment, physical address space 502 isdivided equally among the four nodes. In other embodiments, physicaladdress space 502 may be divided unequally between the nodes. It isnoted that the four node multiprocessor computer system of FIG. 4 isonly an illustration. Multiprocessing computer systems with any numberof nodes are contemplated.

[0100] The logical address space 504 of the computer system is alsoillustrated in FIG. 4. In the illustrated embodiment, two logicaladdress spaces are mapped to each local physical address space within anode. For example, local address space 514 and CMR address space 516 areboth mapped to local physical address space 506. In other words anaccess to an offset within local address space 514 accesses the samephysical memory location as an access to CMR space 516 with the sameoffset.

[0101] Local address space 518 and CMR address space 520 are mapped tolocal physical address space 508 of node 1. In a similar manner, localaddress space 522 and CMR address space 524 are mapped to local physicaladdress space 510 of node 2. Lastly, local address space 526 and CMRaddress space 528 are mapped to local physical address space 512 in node3.

[0102] Turning now to FIG. 5, a format of an address 601 of amultiprocessing system according to one embodiment of the presentinvention is shown. In the illustrated embodiment, address 601 includesfour fields: an offset field 602, a CMR bit 604, a node ID field 606 anda coherency field 608 computer. Offset field 602 identifies a pagewithin an address space and an offset within the page. In theillustrated embodiment, offset field 602 is 37 bits. In one embodiment,the upper four bits of offset field 602 are reserved. As discussedabove, CMR bit 604 identifies a logical address space within a node. Inone embodiment, the CMR bit identifies either a local address space or aCMR space. In one particular embodiment, the logical address space andthe CMR space are mapped to the same physical address space.Accordingly, a memory controller of a processor ignores the CMR bit.System interface 24, however, uses the CMR bit to determine whether anaddress translation is necessary prior to a global transaction, such asa coherency operation. Node ID field 606 identifies the home node of theaddress. In the illustrated embodiment, node ID field 606 is four bits.Accordingly, a system employing the illustrated address format canaccommodate 16 nodes. If node ID field 606 identifies the requestingnode, the address is a local physical address and accesses to the memorylocation are local. Alternatively, if node ID field 606 identifies aremote node, the address is a global address and accesses to the memorylocation are global. Coherency field 608 indicates whether the addressis in a coherent memory address space or a non-coherent address space.The non-coherent memory address space stores data that is not cached,such as I/O data. In one embodiment, the non-coherent address spaceoccupies half the address space of the multiprocessing computer system.In one particular embodiment, the non-coherent address space occupiesthe most significant half of the system address space.

[0103] Turning now to FIG. 6, an alternative format for a directoryentry 702 according to one embodiment of the present invention is shown.Valid field 73, write back field 75, owner field 77 and sharer field 79are similar to those discussed above in reference to FIG. 2A. Directoryentry 702 may be employed in one embodiment of directory 66. Otherembodiments of directory 66 may employ dissimilar directory entries.Directory entry 702 includes a COMA access (CA) bit 704. When set, theCOMA access bit indicates that a COMA access has been made to thecoherency unit that corresponds to the directory entry. Alternatively,when the COMA access bit is clear, it indicates that only NUMA accesseshave been made to the coherency unit that corresponds to the directoryentry.

[0104] If only NUMA accesses have been made to a particular coherencyunit, a translation from a global address to a local physical address isnot required when a reply is made to a coherency operation.Alternatively, if a COMA access to a coherency unit has been made,shadow copies of data may be stored, in one or more nodes, at a localphysical address which is a translation of the global address.Accordingly, when a demand is made to a coherency operation that hasbeen accessed in COMA mode, the slave node typically must translate theglobal address to a local physical address. In one embodiment, a bitwithin a demand to a coherency operation indicates whether the COMAaccess bit within the directory entry is asserted. Based upon the stateof this bit, the node that receives the reply may determine whether atranslation of the global address is required. In an alternativeembodiment, a control signal can be asserted which indicates whether theCOMA access bit of a directory entry is asserted.

[0105] For example, a node may request read-to-own (RTO) access rightsto a coherency unit. In response to the RTO request, the home node mayinvalidate any copies of the data within the coherency unit in othernodes. In one embodiment, a bit within the invalidate demand indicateswhether any nodes are storing the data within the coherency unit in COMAmode. If data is stored in COMA mode, the global address of theinvalidate demand is translated to a local physical address (if data isstored in NUMA mode on that node, the translation may be a unitytranslation) and the data corresponding to the translated local physicaladdress is invalidated. If data is only stored in NUMA mode in thesystem, a special invalidate command that indicates that no translationis required may be sent to the nodes. In this manner, the latencyassociated with the translation from the global address to localphysical address may be eliminated.

[0106] In an alternative embodiment, directory 66 stores informationindicative of which nodes are storing data in COMA mode and which nodesare storing data in NUMA mode. In this manner, translation invalidatecommands may be sent to the nodes storing data in COMA mode, andinvalidate commands specify no translation may be sent to nodes storingdata in NUMA mode.

[0107] It is noted, that the COMA access bit of a directory entry may beasserted when no COMA data is stored in any of the nodes of themultiprocessing system. For example, a COMA access may be made to datawithin a particular coherency unit. The COMA access causes the COMAaccess bit of the directory entry corresponding to the coherency unit tobe asserted. Subsequently, the COMA access data is discarded orinvalidated by the node storing the COMA access data. In thisembodiment, the COMA access bit may still be asserted and translationsfrom a global address to a local physical address may be unnecessarilyperformed during a coherency operation. In an alternative embodiment,the COMA bit may be promptly reset when all COMA data within themultiprocessing computer system has been invalidated. Another example ofunnecessary GA2LPA lookups is for a system where one node stores acoherence unit in COMA mode and the other nodes store it in NUMA mode.This scheme will cause a GA2LPA lookup in all nodes, even though onlythe node in COMA mode required the lookup.

[0108] Turning now to FIG. 7, a diagram illustrating a free memory list802 and a CMR list 804 is shown. As discussed above, in one embodiment,two logical address spaces (local address space and CMR space) aremapped to the local physical address space of a node. In one particularembodiment, a list of free memory space 802 is maintained for each node.Free memory list 802 contains addresses of pages within the local memorythat have not been allocated for data storage. When a processor needsdata space in local memory, the processor stores the data to a pagelisted in the free memory list 802 and removes the address of the page,or pages, to which the data is stored from free memory list 802.

[0109] In one embodiment, a portion of the free memory of a node isallocated as free CMR space. CMR list 804 stores the addresses of pagesof unallocated memory designated as CMR space. When the system needs tostore data to CMR space, the system stores the data to a page within CMRlist 804 and removes the address of the page from CMR list 804. Thesystem allocates CMR space by moving addresses of pages from free list802 to CMR list 804. As illustrated by reference numeral 806, an addressof a page in free memory list 802 may be moved to CMR list 804 toallocate a page of local memory as CMR address space.

[0110] Turning now to FIG. 8, an organization of a local physicaladdress space and a local physical address to global address (LPA2GA)translation table is shown. In some embodiments, the LPA2GA translationtable of a node includes an entry for each page within the localphysical address space of that node. As the size of the local physicaladdress space increases, the size of the LPA2GA translation table alsoincreases. As the size of the LPA2GA table increases, the access time ofthe table also increases. As the access time increases, it becomesimpracticable to access the entire LPA2GA table. One alternative is toimplement the LPA2GA translation table as a cache backed by memory. Themost recently accessed translations may be stored in cache and theentire LPA2GA table stored in memory. This decreases the access time ofthe translation table if a translation is in the cache. However, cachemisses are fairly costly in terms of latency. Additionally, thecomplexity of the LPA2GA translation table is substantially increased.

[0111] In an alternative embodiment illustrated in FIG. 8, several pagesof local physical address space are mapped to one entry of LPA2GAtranslation table 104. In the illustrated embodiment, four pages oflocal physical address space are mapped to one entry of LPA2GAtranslation table 104. For example, in the illustrated embodiment, page802, page 808, page 814 and page 820 of local physical address space 506are mapped to entry 826 of LPA2GA translation table 104. Prior toallocating a page of local memory as a CMR page, the node verifies thatthe corresponding LPA2GA translation table entry that corresponds tothat page is available. If the entry is not available, the page is notallocated as CMR space and a different page is selected from the freelist. In a different embodiment, as discussed above in reference to FIG.7, pages are allocated as CMR space by moving page addresses from freememory list 802 to CMR list 804. Prior to moving a page address fromfree list 802 to CMR list 804, it is verified that the LPA2GAtranslation table entry that corresponds to that page is available.

[0112] In the illustrated embodiment, only one of the four pages thatmap to an entry of LPA2GA translation table 104 may be allocated as CMRspace. For example, assume that page 802 is allocated as CMR space. Ifthe node then attempts to allocate page 808 as CMR space, the node willfind that entry 826 currently stores a translation for page 802.Accordingly, the node will not allocate page 808 as CMR space and chooseanother page such as page 804 to allocate as CMR space. If entry 828 isavailable (i.e., pages 810, 816 and 822 are not allocated as CMR space),then page 804 will be allocated as CMR space and entry 828 is marked asunavailable. Subsequently, page 810, 816 and 822 cannot be allocated asCMR space. A translation may be stored to entry 828 at a later time(e.g., when a shadow page is stored to the page).

[0113] In other embodiments, more or less pages may be mapped to anentry of LPA2GA translation table 104. For example, eight pages of localphysical address space may be mapped to one entry of LPA2GA 104.

[0114] In the above manner, the size of LPA2GA translation table 104 maybe reduced. For example, if four pages are mapped to each entry, thesize of LPA2GA translation table 104 is one quarter the size of aconventional LPA2GA translation table. By reducing the size of LPA2GA104, the entire LPA2GA table may be maintained in a fast memory (e.g.,an SRAM look-up table) without the need for a cache. The circuitry ofthe LPA2GA table is also reduced and the latency associated with a cachemiss is eliminated.

[0115] In one embodiment, the allocation of memory space as CMR space isperformed by software (e.g., the operating system of the node). In oneparticular embodiment, software verifies that the entry of LPA2GAtranslation table 104 that corresponds to a page is available prior toallocating that page of CMR space. A valid bit within the translationtable entries of may be used to indicate that an entry is available orunavailable.

[0116] The above described system limits the amount of memory that maybe allocated as CMR space. For example, if four pages are mapped to eachLPA2GA translation table entry, a maximum of 25% of local memory may beallocated as CMR space. Further, 100% utilization of the maximumavailable CMR space is unlikely. It is reasonable to assume that atleast 75% of the maximum available space may be utilized, which istypically sufficient for CMR space.

[0117] Turning now to FIG. 9, a translation of a local physical addressto a global address according to one embodiment of the present inventionis shown. In the illustrated embodiment, eight pages of local memory aremapped to each entry in LPA2GA translation table 104. As discussed abovein reference to FIG. 8, mapping multiple pages of local memory to oneLPA2GA table entry reduces the size of the LPA2GA table. In theillustrated embodiment, the LPA2GA table has 128k entries for 1M pagesof local physical memory.

[0118] LPA address 901 is substantially the same as the addressdiscussed above in reference to FIG. 5. LPA address 901 includescoherent field 608, node ID field 606, CMR bit 604 and offset field 602.In the illustrated embodiment, offset field 602 is divided into a pageoffset field 903 and an LPA page field 904. LPA page field 904identifies a page of the local memory assigned to the node identified bynode ID field 606. In one embodiment, the most significant four bits ofLPA page field 904 are reserved. Accordingly, LPA page field 904, whichis 24 bits including the reserved bits, may address up to 1M pages pernode. Page offset field 903 identifies a byte, or word, within a page.In the illustrated embodiment, page offset field 903 is 13 bits and eachpage is accordingly 8k bytes (or 8k words).

[0119] LPA2GA table 104 is addressed by the 17 least significant bits ofLPA page field 904. It is noted that in other embodiments, LPA2GA table104 may be addressed by more or less bits. For example, if four pages ofphysical memory were mapped to each LPA2GA entry, the LPA2GA table maybe addressed by 18 bits of LPA page field 904.

[0120] The format of LPA2GA table entry 915 according to one embodimentis illustrated in FIG. 10. LPA2GA table entry 915 includes a reservefield 916, a valid bit 918, a node ID field 920, a LPA page field 922and a parity field 924. In other embodiments, a LPA2GA table entry mayinclude additional fields or may omit fields included in table entry915. Additionally, the fields may include more or less bits than thefields illustrated in FIG. 10. In the illustrated embodiment, reservefield 916 includes five reserve bits. Valid bit 918 indicates whetherthe corresponding table entry stores valid translation data. If thevalid bit is clear, the table entry does not contain a valid translationand is available to store a translation. Node ID field 920 identifiesthe home node within the multiprocessing system that corresponds to theaddress. In the illustrated embodiment, node ID field 920 is four bits.Accordingly, sixteen nodes may be accommodated. LPA page field 922identifies a page within the home node identified by node ID field 920.In the illustrated embodiment, LPA page field 922 is 24 bits.Accordingly, 4M pages may be accommodated. Parity field 924 stores twoparity bits to verify the accuracy of the table entry. In oneembodiment, the parity bits are checked by hardware each time hardwareaccesses a table entry, but are not checked by software accesses.

[0121] Referring back to FIG. 9, the least significant 17 bits of LPApage field 904 of local physical address 901 are used to address LPA2GAtranslation table 104. In the illustrated embodiment, no address tag isstored in table entry 915, even though multiple pages correspond to atable entry. As discussed above in reference to FIG. 8, only one of thepages that correspond to table entry 915 may be allocated as CMR space.Accordingly, only one translation is stored in each entry of LPA2GAtable 104 and no comparisons of address tags are required.

[0122] Global address 902 includes fields substantially similar to LPAaddress 901. Global address 902 includes a coherent field 906, a node IDfield 908, a reserve bit 910, an LPA page field 912, and a page offsetfield 914. Portions of global address 902 are taken directly from LPAaddress 901 and other portions are obtained from fields within theLPA2GA entry addressed by LPA address 901. In the illustrated example,page offset field 904 is taken directly from page offset field 903 ofLPA address 901. Node ID field 920 and LPA page field 922 of the tableentry addressed by LPA address 901 provide the data for node ID field908 and LPA page field 912 of global address 902.

[0123] Turning now to FIG. 11, the organization of a global address tolocal physical address (GA2LPA) table is shown according to oneembodiment of the present invention. Typically, the GA2LPA table of eachnode in a multiprocessing system must include one entry for each page inthe multiprocessing system. For example, in a multiprocessing systemwith four nodes each including 1M pages of local physical address space,the GA2LPA table must include 4M entries. The access time associatedwith a table of that size typically adds unacceptable latency to thetransaction. In one embodiment, the access time of the GA2LPA table isreduced by providing a cache to store the most recently accessed GA2LPAtranslations. The cache is typically backed by memory which stores theentire GA2LPA table. Unfortunately, this solution adds complexity to theGA2LPA table, requires a significant amount of RAM to store the GA2LPAtable, and adds significant latency in the case of a cache miss.

[0124] In an alternative embodiment, the size of GA2LPA table 112 may bereduced by recognizing that only shadow pages need address translations.Multiple global address pages are mapped to each entry in GA2LPA table.Prior to storing data as a shadow page (i.e. storing data in a COMAmanner), GA2LPA table 112 is checked to see if the entry in the tablethat corresponds to the global address is available. If the entry isavailable, the global address is translated to a local physical addressusing a page address from CMR list 804 discussed above in reference toFIG. 7. Alternatively, if the corresponding entry in GA2LPA table 112 isunavailable (i.e., the entry is storing a translation), a shadow copy ofthe data is not stored and the data is stored in NUMA mode. Accordingly,there is some probability that a node may not be able to store data inCOMA mode. This probability may be reduced by expanding the size ofGA2LPA table 112 or making GA2LPA table 112 more associative.

[0125] In the embodiment illustrated in FIG. 11, a two-way setassociative GA2LPA translation table 112 is shown. Accordingly, twopages associated with one entry of GPA2LPA table 112 may be stored asshadow pages. If one way of an entry is occupied, data may still bestored as to a shadow page and the conversion entered in the second wayof the entry. If both entries contain valid translations, the page maynot be stored as a shadow page and is stored in NUMA mode.

[0126] As discussed above, only pages which have a valid translation inthe GA2LPA table are converted to shadow pages. If a global addressreceived as part of a request does not have a corresponding translationin the GA2LPA table, then no shadow page exists that corresponds thatglobal address. Accordingly, no GA2LPA translation is required. Any datathat corresponds to the global address on that node is stored in NUMAmode and accordingly the global address may be used to access the data.In other words, the absence of a translation in GA2LPA table 112provides information to the node that the page has only been stored inNUMA mode on that node.

[0127] As discussed in more detail below, a portion of the globaladdress is used to address an entry of GA2LPA table 112. Becausemultiple pages are mapped to one entry, a portion of the global address(typically more significant bits than the bits used to address theentry) is compared to address tags stored with each entry. If the bitsof the global address match either of the address tags, then GA2PLAtable 112 stores a translation for the address and the translation datais used to form a local physical address. Alternatively, if the addresstags do not match the bits of the global address, no translation existsfor that global address and the global address is used to address anydata on the node.

[0128] In other embodiments, other organizations for GA2LPA table 112may be implemented. For example, GA2LPA table 112 may be organized as afour-way set associative table. Generally speaking, increasing theassociatively of the table decreases the probability of not being ableto store data in COMA mode. For example, if a four-way set associativeGA2LPA table is used and the table is twice the size of thecorresponding LPA2GA table, the probability of available space in theGA2LPA table is 98%, assuming that 75% of the available CMR memory isused. If only 50% of the available CMR space is used, the probability offinding available space in the GA2LPA table is 99.6%.

[0129] One possible organization of a four-way set associative cache isto put the address tags of all four ways in one word. This address tagword is accessed first. If none of the address tags match the bits ofthe global address, the address does not have a GA2LPA translation andno more accesses to GA2LPA table 112 are required. If one address tagmatches the bits of the global address, the way that corresponds to theglobal address may be determined and the translation informationcorresponding to the way accessed. Alternatively, the four-ways may besequentially accessed and the address tags compared to the bits of theglobal address. The same strategies may be used with other tableorganizations, such as a two-way set associative table.

[0130] Turning now to FIG. 12A, an alternative organization of a GA2LPAtable is shown according to one embodiment of the present invention.GA2LPA table 122 is organized as a skewed-associative cache. Indexfunction 124 and index function 126 convert the global address into twodifferent look-up addresses for GA2LPA table 122. The address tagsstored in the entries addressed by the look-up addresses are compared tosome portion of the global address. The comparison must include enoughbits such that the combination of the lookup location and the addresstag uniquely identifies one GA address. If the address tag of an entrymatches the global address, the translation data stored in that entry isused to form the local physical address. Alternatively, if neitheraddress tag matches the global address, the global address is used toaccess the data. In other words, if neither entry stores a GPA2LPAtranslation for that global address, then the data that corresponds tothe global address is stored in NUMA mode within the node and the datais accessed using the global address.

[0131] The look-up address generated by index function 126 for oneaddress may be identical to the index function generated by indexfunction 124 for a different address. In one embodiment, the look-upaddress generated by index function 124 is a subset of the address bitsof the global address. In one embodiment, the look-up address generatedby index function 126 may be the bit-wise exclusive OR of a plurality ofbits within the global address.

[0132] Each entry in GA2LPA table 122 includes an address tag field 127,a mode bit 128 and a translation field 129. As discussed above, addresstag field 127 stores the address tag of the global address thatcorresponds to an entry. Mode bit 128 is required to prevent falsematches. The mode bit indicates whether the entry of the storedtranslation was derived using index function 124 or index function 126.The address tags and mode bit must match in order to select a tableentry. Translation field 129 stores the data necessary to generate alocal physical address from the global address. Translation field 129 isdiscussed in more detail below in reference to FIG. 13.

[0133] In one embodiment, when storing translations to GA2LPA table 122,the translation data is first attempted to be stored in the entryaddressed by index function 124 (referred to herein as the primaryentry). If the primary entry is used by another translation, thetranslation information is then attempted to be stored in the entryaddressed by index function 126 (referred to herein as the secondaryentry). If the secondary entry is also occupied, no translation isstored for that global address and no shadow page is allocated for thatglobal address.

[0134] Turning now to FIG. 12B, a method for maximizing utilization oftables such as GA2LPA table 122 according to one embodiment of thepresent invention is shown. FIG. 12B illustrates a plurality of entries132-142 in GA2LPA table 122. Columns 144, 146 and 148 illustrate thelook-up addresses for a plurality of global addresses. Each globaladdress has a primary entry and a secondary entry in GA2LPA table 122.In the illustrated embodiment, the primary entry is identified by “P”next to the entry and a secondary entry is identified “S” next to theentry. The entry in which the translation is stored is identified by acircle around the letter identifying the entry. In one embodiment, theprimary entry corresponds to the look-up address generated by indexfunction 124 and the secondary entry corresponds to the look-up addressgenerated by index function 126. For example, the primary entrycorresponding to global address 1 is entry 132 and the secondary entrycorresponding to global address 1 is entry 138. In the illustratedembodiment, the translation for global address 1 is stored in theprimary entry, which is entry 132. In a similar manner, the primaryentry corresponding to global address 2 is entry 140 and the secondaryentry is entry 136. In the illustrated embodiment, the translation forglobal address 2 is stored in entry 136. The primary entry correspondingto global address 3 is 136 and the secondary entry is entry 132.

[0135] In one embodiment, the translation for global address 3 cannot bestored in GA2LPA table 122 because both entries associated with globaladdress are 3 occupied by other translations. The primary entryassociated with global address 3 (entry 136) is occupied by globaladdress 2 and the secondary entry (entry 132) is occupied by globaladdress 1. To improve the availability of entries in GPA2LPA table 122,the translations for either global address 1 or global address 2 may bemoved to the other entry associated with that address. For example, thetranslation for global address 2 is stored in the secondary entryassociated with global address 2 (entry 136). If the translation ismoved to the primary entry (entry 140), then entry 136 is available tostore the translation for global address 3. Alternatively, thetranslation for global address 1 could be moved from entry 132 to entry138, which makes entry 132 available to store the translation for globaladdress 3. In this manner, the utilization of GA2LPA table 122 may beincreased.

[0136] The utilization of the table approaches the utilization of afully associative table while maintaining a relatively simple look-upfunction. Only two entries need to be accessed during look-up. In otherwords, from the look-up standpoint, the table is similar to a two-wayskewed associative cache. The utilization of the table, however,approaches the utilization of a fully associative table. In oneembodiment, software performs the realignment function of movingtranslations between entries to make space available for new entries.

[0137] Turning now to FIG. 12C, another illustration of a method forincreasing the utilization of a translation table is shown. In theillustrated embodiment, the primary and secondary entries associatedwith five global addresses are shown in columns 152-160. The primaryentry associated with global address 1 is entry 132 and the secondaryentry is entry 138. The translation is stored in entry 132. The primaryentry associated with global address 2 is entry 140 and the second entryis 136. The translation is stored in entry 136. The primary entryassociated with global address 3 is entry 138 and the secondary entry isentry 142. The translation is stored in entry 138. The primary entryassociated with global address 4 is entry 134 and the secondary entry isentry 140. The translation is stored in secondary entry 140. In asimilar manner to that discussed above in reference to FIG. 12B, thetranslation for global address 5 cannot be stored in GA2LPA table 122absent a method for improving the utilization of GA2LPA table 122.

[0138] The translation for global address 5 cannot be stored in thetable because both the primary and secondary entries associated withglobal address 5, entries 136 and 132 respectively, are occupied bytranslations of other global addresses. The translation for globaladdress 1 cannot be moved from entry 132 to entry 138 because entry 138is currently occupied by the translation of global address 3. Likewise,the translation for global address 2 cannot be moved from entry 136 toentry 140 because entry 140 is occupied by the translation for globaladdress 4. In order to make an entry available in GA2LPA table 122 forthe translation of global address 5, either the entry storing thetranslation for global address 3 or global address 4 is moved. Thisallows the translation of either global address 1 or global address 2 tobe moved, which allows the translation of global address 5 to be storedin GA2LPA table 122. For example, the translation for global address 3may be moved from entry 138 to entry 142. The translation for globaladdress 1 then may be moved from entry 132 to entry 138. The translationfor global address 5 then may be stored in entry 132. Alternatively, thetranslation for global address 4 may be moved from entry 140 to entry134. The translation of global address 2 may then be moved from entry136 to entry 140 and the translation for global address 5 stored inentry 136.

[0139] The methodology illustrated in FIGS. 12B and 12C may be repeatedfor several iterations before an entry becomes available. Although therealignment of the GA2LPA table may be time consuming, the overhead isonly incurred once for each new translation. Additionally, therealignment can occur off the critical path of the processor. In oneembodiment, only one new translation can be added at one time. Althoughthe method for increasing the utilization of a table is described abovein reference to GPA2LPA table 122, it is noted that the methodology maybe applied to any table that employs skewing or hashing functions.

[0140] Turning now to FIG. 13, a translation of a global address 902 toa local physical address 901 according to one embodiment of the presentinvention is shown. The fields of global address 902 and local physicaladdress 901 are substantially similar to the fields discussed above inreference to FIG. 9. In the illustrated embodiment, page offset field914 from global address 902 is copied to page offset field 903 of localphysical address 901. Because address 901 is a local physical address,node ID field 606 identifies the home node of the local physicaladdress. In one embodiment, CMR bit 604 is asserted in the localphysical address because the local physical address identifies a shadowpage of the page identified by global address 902. LPA page field 904 isobtained from an output of GA2LPA table 122.

[0141] In the illustrated embodiment, the least significant 17 bits ofLPA page field 912 are provided to index function 124 and index function126. The address tags 132 from the two entries that correspond to thelook-up addresses output by index function 124 and index function 126are compared to node ID field 908 and 7 bits of LPA page field 912 bycomparator 134. If a match is found, the entry with the matching addresstag outputs the translation information to LPA page field 904. In theillustrated embodiment, the two most significant bits of LPA page field904 are always 0 to reduce the number of bits stored in each entry ofGA2LPA table 122.

[0142] In the illustrated embodiment, each entry is GA2LPA table 122includes an 11-bit address tag, a 22-bit LPA page translation, a modebit, and two parity bits.

[0143] Turning now to FIG. 14A, a flowchart illustrating the storage ofshadow pages and the allocation of entries within a GA2LPA table isshown. In step 202, portions of memory are allocated as CMR space. Asdiscussed above, CMR space is used to store shadow copies of data fromremote nodes. Step 202 is discussed in more detail below in reference toFIG. 14B. In step 204, a request to store a shadow copy of data (i.e.,store data in COMA mode) is received. As discussed above, shadow pagesare assigned a local physical address such that future accesses to thedata are local rather than global. As discussed in more detail below, anentry within the GA2LPA table must be available before a node will allowa shadow page to be stored.

[0144] In decisional step 206, it is determined whether a GA2LPA tableentry that corresponds with the global address of the data to be storedin the shadow page is available. In one embodiment, in order to reducethe number of entries in the GA2LPA table, multiple global addresses areassigned to each entry in the GA2LPA table. In one particularembodiment, the GA2LPA table is a set associative table such thattranslations of multiple global addresses that correspond to one entrymay be stored. If the entry associated with the global address isavailable, then in step 208, the data is stored to a shadow page in theCMR space and the address of the page is removed from the CMR list. Instep 210, the translation data for translating between the globaladdress and local physical address is stored to the appropriate entriesin the GA2LPA table and the LPA2GA table.

[0145] If in decisional step 206 no entry is available in the GA2LPAtable, then in step 212, the GA2LPA table may be realigned. Realignmentis discussed in more detail below in reference to FIG. 14C. In step 214,it is determined whether the realignment of step 212 was successful(i.e., a table entry corresponding the address is available). If therealignment was successful, then steps 208 and 210 described above areperformed. Alternatively, if the realignment of step 212 wasunsuccessful, then in step 216 the data is stored in NUMA mode.

[0146] Turning now to FIG. 14B, a flowchart illustrating the allocationof CMR space according to one embodiment of the present invention isshown. In step 218, the pages of the local memory of a node are mappedto entries in an LPA2GA table. In one embodiment, multiple pages oflocal memory are mapped to each entry in the LPA2GA table. In oneparticular embodiment, four pages of local memory are mapped to eachentry in the LPA2GA table. Mapping multiple pages of local memory toeach entry in the LPA2GA table effectively reduces the size of theLPA2GA table. However, as discussed in more detail below, pages may onlybe allocated in CMR space if a entry is available in the LPA2GPA tablefor storing the translation for that page.

[0147] In step 220, a page address from a free memory list is retrieved.In one embodiment, the free memory list is a list of addresses of pageswhich have not been allocated for storage. In the illustratedembodiment, CMR space is allocated by moving page addresses from thefree memory list to a CMR list. Accordingly, the CMR list stores pageaddresses of available pages allocated as CMR space.

[0148] In decisional step 222, it is determined whether the LPA2GA tableentry that corresponds to the retrieved page address is available. Asdiscussed above, in one embodiment, multiple page addresses are mappedto one entry in the LPA2GA table. If an entry stores a translation for apage mapped to the same entry, the entry is not available. If the entryis not available, then in step 224, a new page address is retrieved fromthe free memory list. Steps 222 and 224 are repeated until a pageaddress from the free memory list with an available entry in LPA2GAtable is retrieved.

[0149] In step 226, the page address for the retrieved address with anavailable LPA2GA table entry is moved from the free memory list to theCMR list. In step 228, the LPA2GA table entry that corresponds to theretrieved page address is marked as unavailable. In another alternativeembodiment, steps 226 and 228 may be performed in parallel. In anotheralternative embodiment, step 228 may be performed before step 226. Inone embodiment, a table entry is marked as unavailable by asserting avalid bit within the table entry.

[0150] Turning now to FIG. 14C, a flowchart illustrating the realignmentof entries in a GA2LPA table according to one embodiment of the presentinvention is shown. It is noted, that the realignment of a GA2LPA tableis shown for a illustrative purposes only. The same methodology may beused for any table employing primary and secondary entries for anaddress, such as a table employing skewed associativity or hashingfunctions. The flowchart contemplates a table in which each address ismapped to a primary entry and a secondary entry. If both the primaryentry and secondary entry of an address are occupied by othertranslations, the entries are realigned by moving the translation fromthe primary entry to the secondary entry or from the secondary entry tothe primary entry. In this manner, an entry may be made available forstoring a new translation. Several iterations of realignment may berequired before an entry is made available.

[0151]FIG. 14C contemplates an instance in which both the primary andsecondary entries of an address are occupied. In FIG. 14C, a flowchartfor the realignment of a GA2LPA table to make the primary entry of a newtranslation available is shown. It is noted, that the same methodologymay be used to realign the GA2LPA table such that the secondary entry ofthe new translations is available. It is contemplated, that therealignment to make the primary entry available and the realignment tomake a secondary entry available are performed concurrently. The firstentry made available is used for the translation and the realignment issuspended.

[0152] In step 230, the look-up address for the alternate entry of thetranslation stored in the primary entry is computed. For example, if theprimary entry is a secondary entry of another translation, the look-upaddress of the primary entry of the translation is computed. In oneembodiment, a mode bit indicates whether an entry corresponds to theprimary or secondary entry of the address. The look-up address of thealternate entry may be determined by applying the inverse of the indexfunction used to generate the entry address to obtain the originalglobal address and then applying another index function. The mode bitmay indicate that the translation stored in the primary entry is thesecondary entry for the translation. The inverse of the index functionused to generate the secondary entry look-up address is applied to theentry address, which outputs the global address of the translation. Theprimary index function is then applied to the global address to computethe look-up address of the primary entry of the translation.

[0153] In step 232, it is determined whether the alternate entry isavailable. If the alternate entry is available, then in step 234, thetranslation stored in the primary entry is moved to its alternate entry.The primary entry is now available to store a new translation.

[0154] Alternatively, if the alternate entry of the translation storedin the primary entry is unavailable, then in step 236, the alternateentry of the alternate entry of the translation stored in the primaryentry is computed. It is determined whether this entry is available indecisional step 238. If the entry is available, then in step 240, thetranslation stored in the alternate entry of the translation stored inthe primary entry is moved to its alternate entry. The alternate entryfor the translation stored in the primary entry is now available. Instep 242, the translation stored in the primary entry is moved to itsalternate entry. The primary entry is now available to store the newtranslation.

[0155] If the translation in step 238 is not available, the look-upaddress for the alternate entry of the alternate entry of the alternateentry stored in the primary entry is computed. Steps similar to steps238-244 are repeated until the table has been realigned to make spaceavailable for the new translation or until a predetermined number ofiterations has been performed without successfully realigning the table.

[0156] Although SMP nodes 12 have been described in the above exemplaryembodiments, generally speaking an embodiment of computer system 10 mayinclude one or more processing nodes. As used herein, a processing nodeincludes at least one processor and a corresponding memory.Additionally, circuitry for communicating with other processing nodes isincluded. When more than one processing node is included in anembodiment of computer system 10, the corresponding memories within theprocessing nodes form a distributed shared memory. A processing node maybe referred to as remote or local. A processing node is a remoteprocessing node with respect to a particular processor if the processingnode does not include the particular processor. Conversely, theprocessing node which includes the particular processor is thatparticular processor's local processing node.

[0157] Numerous variations and modifications will become apparent tothose skilled in the art once the above disclosure is fully appreciated.It is intended that the following claims be interpreted to embrace allsuch variations and modifications.

What is claimed is:
 1. A look-up table configured to store and outputdata corresponding to an input address comprising: a plurality ofentries for storing said data; and a look-up address circuit configuredto receive said input address, wherein said look-up address includes: afirst index function circuit configured to convert a first input addressto a primary look-up address that corresponds to said first inputaddress, wherein a primary entry of said plurality of entries isaddressed by said primary look-up address; and a second index functioncircuit configured to convert said first input address to a secondarylook-up address that corresponds to said first input address, wherein asecondary entry of said plurality of entries is addressed by saidsecondary look-up address; wherein said look-up table is configured tostore a first datum to said primary entry if said primary entry isavailable and wherein said look-up table is configured to store saidfirst datum to said secondary entry if said primary entry isunavailable; wherein if said primary entry and said secondary entry areunavailable, said look-up table is configured to move a second datumstored in said primary entry to an alternate entry for said second datumand to store said first datum to said primary entry.
 2. The look-uptable of claim 1 wherein if a said alternate entry for said second datumis unavailable, said look-up table is configured to move a third datumstored in said alternate entry for said second datum to an alternateentry for said third datum, to move said second datum to said alternateentry for said second datum, and to store said first datum to saidprimary entry.
 3. The look-up table of claim 1 wherein reading saidfirst datum comprising accessing said primary entry and said secondaryentry.
 4. The look-up table of claim 1 wherein said primary look-upaddress is a subset of the bits of said first input address.
 5. Thelook-up table of claim 2 wherein said secondary look-up address is abit-wise exclusive-oring of a subset of the bits of said first inputaddress.
 6. The look-up table of claim 1 wherein said input addressesare global addresses and said data are translations of global addressesto local physical addresses.
 7. The look-up table of claim 1 furthercomprising a realignment unit coupled to said look-up address circuitand said plurality of entries, wherein said realignment unit moves saidsecond datum to said alternate entry by: computing an input addresscorresponding to said second datum; computing an alternate look-upaddress of said second datum; and moving said second datum to saidalternate entry addressed by said alternate look-up address.
 8. Thelook-up table of claim 7 wherein said realignment unit computes saidinput address corresponding to said second datum by determining whichindex function was used to generate an original look-up address of saidentry and applying an inverse of said index function to said inputaddress corresponding to said second datum.
 9. The look-up table ofclaim 8 wherein said realignment unit computes said alternate look-upaddress of said second datum by applying the index function that was notused to generate said original look-up address of said entry to saidinput address corresponding to said second datum.
 10. The look-up tableof claim 1 wherein an entry is unavailable if said entry stores a datumthat corresponds to another input address.
 11. A look-up tableconfigured to store and output data corresponding to input addressescomprising: a plurality of entries for storing said data; and a look-upaddress circuit configured to receive said input address, wherein saidlook-up address includes: a first index function circuit configured toconvert a first input address to a primary look-up address thatcorresponds to said first input address, wherein a primary entry of saidplurality of entries is addressed by said primary look-up address; and asecond index function circuit configured to convert said first inputaddress to a secondary look-up address that corresponds to said firstinput address, wherein a secondary entry of said is addressed by saidsecondary look-up address; wherein said look-up table is configured tostore a first datum to said primary entry if said primary entry isavailable and wherein said look-up table is configured to store saidfirst datum to said secondary entry if said primary entry isunavailable; wherein if said primary entry and said secondary entry areunavailable, said look-up table is configured to move a second datumstored in said secondary entry to an alternate entry for said seconddatum and to store said first datum to said secondary entry.
 12. Amethod of storing and retrieving data in a look-up table wherein thedata corresponds to input addresses and each input address correspondsto a primary entry and a secondary entry of said look-up tablecomprising: if a primary entry corresponding to a first input address isavailable, storing a first datum to said primary entry; if said primaryentry is unavailable, storing said first datum to a secondary entrycorresponding to said first input address; if said primary entry andsaid secondary entry are unavailable, moving a second datum stored insaid primary entry to an alternate entry of said second datum andstoring said first datum to said primary entry.
 13. The method of claim12 further comprising if said alternate entry for said datum isunavailable, moving a third datum stored in said alternate entry forsaid second datum to an alternate entry of said third datum, moving saidsecond datum to said alternate entry of said second datum, and storingsaid first datum to said primary entry.
 14. The method of claim 12further comprising reading said first datum by accessing said primaryentry associated with said first input address and said secondary entryassociated with said first input address.
 15. The method of claim 12wherein a first index function generates an address for said primaryentry and a second index function generates an address for saidsecondary entry.
 16. The method of claim 15 wherein said first indexfunction generates a first look-up address which is a subset of the bitsof said first input address and said second index function generates asecond look-up address which is a bit-wise exclusive-oring of a subsetof the bits of said first input address.
 17. The method of claim 12wherein said moving said second datum stored in said primary entry tosaid alternate entry comprises: computing an input address correspondingto said second datum; computing an alternate look-up address for saidsecond datum; and moving said second datum to said alternative entryaddress by said alternate look-up address.
 18. The method of claim 17wherein said computing of said input address corresponding to saidsecond datum comprises determining which index function was used togenerate an original look-up address of said entry and applying aninverse of said index function to said input address corresponding tosaid second datum.
 19. The method of claim 18 wherein computing saidalternate look-up address for said second datum comprises using theindex function that was not used to generate said original look-upaddress of said entry to input address corresponding to said seconddatum.
 20. The method of claim 12 wherein an entry is unavailable ifsaid entry stores a datum that corresponds to another input address. 21.A method of storing and retrieving data in a look-up table wherein thedata corresponds to input addresses and each input address correspondsto a primary entry and a secondary entry of said look-up tablecomprising: if a primary entry corresponding to a first input address isavailable, storing a first datum to said primary entry; if said primaryentry is unavailable, storing said first datum to a secondary entrycorresponding to said first input address; if said primary entry andsaid secondary entry are unavailable, moving a second datum stored insaid secondary entry to an alternate entry of said second datum andstoring said first datum to said secondary entry.